Setup instructions

Install the Anaconda Python distribution

If using your own computer please install the Anaconda Python distribution from https://www.anaconda.com/download/. (Note that Python version \(\leq\) 3.0 differs considerably from more recent releases. For this workshop you will need version \(\geq\) 3.4.)

Accepting the defaults proposed by the Anaconda installer is generally recommended.

Workshop notes

The class notes for this workshop are available on our website at dss.iq.harvard.edu under Workshop Materials ==> Python Workshop Materials => Python Web Scraping. Click the All workshop materials link to download the workshop materials.

Extract the PythonWebScraping.zip directory (Right-click => Extract All on Windows, double-click on Mac).

Start the Jupyter Notebook application and open the Exercises.ipynb file in the PythonWebScraping folder you downloaded previously. You may also wish to start a new notebook for your own notes.

Workshop goals and approach

In this workshop you will

  • learn basic web scraping principles and techniques,
  • learn how to use the requests package in Python,
  • practice making requests and manipulating responses from the server.

This workshop is relatively informal, example-oriented, and hands-on. We will learn by working through an example web scraping project.

Note that this is not an introductory workshop. Familiarity with Python, including but not limited to knowledge of lists and dictionaries, indexing, and loops and / or comprehensions is assumed. If you need an introduction to Python or a refresher, we recommend the IQSS Introduction to Python.

Note also that this workshop will not teach you everything you need to know in order to retrieve data from any web service you might wish to scrape. You can expect to learn just enough to be dangerous.

Preliminary questions

What is web scraping?

Web scraping is the activity of automating retrieval of information from a web service designed for human interaction.

Example project overview and goals

In this workshop I will demonstrate web scraping techniques using the Collections page at https://www.harvardartmuseums.org/collections and let you use the skills you’ll learn to retrieve information from other parts of the Harvard Art Museums website.

The basic strategy is pretty much the same for most scraping projects. We will use our web browser (Chrome or Firefox recommended) to examine the page you wish to retrieve data from, and copy/paste information from your web browser into your scraping program.

Take shortcuts if you can

We wish to extract information from https://www.harvardartmuseums.org/collections. Like most modern web pages, a lot goes on behind the scenes to produce the page we see in our browser. Our goal is to pull back the curtain to see what the website does when we interact with it. Once we see how the website works we can start retrieving data from it. If we are lucky we’ll find a resource that returns the data we’re looking for in a structured format like JSON or XML.

Examining the structure of our target web service

We start by opening the collections web page in a web browser and inspecting it.

If we scroll down to the bottom of the Collections page, we’ll see a button that says “Load More”. Let’s see what happens when we click on that button. To do so, click on “Network” in the developer tools window, then click the “Load More Collections” button. You should see a list of requests that were made as a result of clicking that button, as shown below.

If we look at that second request, the one to a script named browse, we’ll see that it returns all the information we need, in a convenient format called JSON. All we need to retrieve collection data is call make GET requests to https://www.harvardartmuseums.org/browse with the correct parameters.

Making requests using python

The URL we want to retrieve data from has the following structure

scheme                    domain    path  parameters
 https www.harvardartmuseums.org  browse  load_amount=10&offset=0

It is often convenient to create variables containing the domain(s) and path(s) you’ll be working with, as this allows you to swap out paths and parameters as needed. Note that the path is separated from the domain with / and the parameters are separated from the path with ?. If there are multiple parameters they are separated from each other with a &.

For example, we can define the domain and path of the collections URL as follows:

## 'https://www.harvardartmuseums.org/browse'

Note that we omit the parameters here because it is usually easier to pass them as a dict when using the requests library in Python. This will become clearer shortly.

Now that we’ve constructed the URL we wish interact with we’re ready to make our first request in Python.

Parsing JSON data

We already know from inspecting network traffic in our web browser that this URL returns JSON, but we can use Python to verify this assumption.

Since JSON is a structured data format, parsing it into python data structures is easy. In fact, there’s a method for that!

## {'info': {'next': 'https://api.harvardartmuseums.org/object?apikey=67d9edc0-e6a3-11e3-9798-57275476509a&sort=rank&sortorder=asc&from=0&size=10&page=2',
##           'page': 1,
##           'pages': 23306,
##           'totalrecords': 233060,
##           'totalrecordsperquery': 10},
##  'records': [{'accessionmethod': 'Purchase',
##               'accessionyear': 2000,
##               'accesslevel': 1,
##               'century': '17th century',
##               'classification': 'Prints',
##               'classificationid': 23,
##               'colorcount': 7,
##               'colors': [{'color': '#c8af96',
##                           'css3': '#d2b48c',
##                           'hue': 'Brown',
##                           'percent': 0.26919540229885,
##                           'spectrum': '#e66c64'},
##                          {'color': '#644b32',
##                           'css3': '#556b2f',
##                           'hue': 'Brown',
##                           'percent': 0.1632183908046,
##                           'spectrum': '#59ba4a'},
##                          {'color': '#c8c8af',
##                           'css3': '#c0c0c0',
##                           'hue': 'Green',
##                           'percent': 0.14942528735632,
##                           'spectrum': '#b55592'},
##                          {'color': '#af967d',
##                           'css3': '#bc8f8f',
##                           'hue': 'Brown',
##                           'percent': 0.13298850574713,
##                           'spectrum': '#c25687'},
##                          {'color': '#7d644b',
##                           'css3': '#696969',
##                           'hue': 'Yellow',
##                           'percent': 0.12551724137931,
##                           'spectrum': '#b25593'},
##                          {'color': '#967d64',
##                           'css3': '#808080',
##                           'hue': 'Brown',
##                           'percent': 0.10793103448276,
##                           'spectrum': '#b65590'},
##                          {'color': '#323219',
##                           'css3': '#2f4f4f',
##                           'hue': 'Brown',
##                           'percent': 0.051724137931034,
##                           'spectrum': '#3db657'}],
##               'commentary': None,
##               'contact': 'am_europeanamerican@harvard.edu',
##               'contextualtextcount': 0,
##               'copyright': None,
##               'creditline': 'Harvard Art Museums/Fogg Museum, '
##                             'Light-Outerbridge Collection, Richard Norton '
##                             'Memorial Fund',
##               'culture': 'Dutch',
##               'datebegin': 1615,
##               'dated': '1615',
##               'dateend': 1615,
##               'dateoffirstpageview': '2009-09-17',
##               'dateoflastpageview': '2018-12-29',
##               'department': 'Department of Prints',
##               'description': None,
##               'dimensions': 'plate: 11.9 x 31 cm (4 11/16 x 12 3/16 in.)',
##               'division': 'European and American Art',
##               'edition': None,
##               'exhibitioncount': 2,
##               'groupcount': 1,
##               'id': 197886,
##               'imagecount': 1,
##               'imagepermissionlevel': 0,
##               'images': [{'baseimageurl': 'https://idscache.harvardartmuseums.org/ids/view/20424985',
##                           'copyright': 'President and Fellows of Harvard '
##                                        'College',
##                           'displayorder': 1,
##                           'format': 'image/jpeg',
##                           'height': 440,
##                           'idsid': 20424985,
##                           'iiifbaseuri': 'https://ids.lib.harvard.edu/ids/iiif/20424985',
##                           'imageid': 271612,
##                           'publiccaption': None,
##                           'renditionnumber': 'INV162750',
##                           'width': 1024}],
##               'labeltext': None,
##               'lastupdate': '2019-01-22T03:48:46-0500',
##               'markscount': 0,
##               'mediacount': 0,
##               'medium': None,
##               'objectid': 197886,
##               'objectnumber': 'M24774',
##               'people': [{'alphasort': 'Velde, Jan van de II',
##                           'birthplace': 'Rotterdam(?), Netherlands',
##                           'culture': 'Dutch',
##                           'deathplace': 'Enkhuizen, Netherlands',
##                           'displaydate': '1593 - 1641',
##                           'displayname': 'Jan van de Velde II',
##                           'displayorder': 1,
##                           'gender': 'male',
##                           'name': 'Jan van de Velde II',
##                           'personid': 29190,
##                           'prefix': None,
##                           'role': 'Artist'}],
##               'peoplecount': 1,
##               'period': None,
##               'periodid': None,
##               'primaryimageurl': 'https://idscache.harvardartmuseums.org/ids/view/20424985',
##               'provenance': None,
##               'publicationcount': 0,
##               'rank': 1,
##               'relatedcount': 0,
##               'seeAlso': [{'format': 'application/json',
##                            'id': 'https://iiif.harvardartmuseums.org/manifests/object/197886',
##                            'profile': 'http://iiif.io/api/presentation/2/context.json',
##                            'type': 'IIIF Manifest'}],
##               'signed': None,
##               'standardreferencenumber': 'H. 180, Fr.-v.d.K. 219',
##               'state': 'ii/ii',
##               'style': None,
##               'technique': 'Etching',
##               'techniqueid': 116,
##               'title': 'The Temple of the Sibyl at Tivoli',
##               'titlescount': 2,
##               'totalpageviews': 420,
##               'totaluniquepageviews': 360,
##               'url': 'https://www.harvardartmuseums.org/collections/object/197886',
##               'verificationlevel': 3,
##               'verificationleveldescription': 'Good. Object is well described '
##                                               'and information is vetted',
##               'worktypes': [{'worktype': 'print', 'worktypeid': '278'}]},
##              {'accessionmethod': 'Purchase',
##               'accessionyear': 2008,
##               'accesslevel': 1,
##               'century': '19th century',
##               'classification': 'Paintings',
##               'classificationid': 26,
##               'colorcount': 10,
##               'colors': [{'color': '#fafafa',
##                           'css3': '#fffafa',
##                           'hue': 'White',
##                           'percent': 0.39489096573209,
##                           'spectrum': '#955ba5'},
##                          {'color': '#afafaf',
##                           'css3': '#a9a9a9',
##                           'hue': 'Grey',
##                           'percent': 0.18953271028037,
##                           'spectrum': '#8c5fa8'},
##                          {'color': '#c8af64',
##                           'css3': '#bdb76b',
##                           'hue': 'Yellow',
##                           'percent': 0.13109034267913,
##                           'spectrum': '#b0d136'},
##                          {'color': '#969696',
##                           'css3': '#a9a9a9',
##                           'hue': 'Grey',
##                           'percent': 0.08816199376947,
##                           'spectrum': '#8761aa'},
##                          {'color': '#7d7d7d',
##                           'css3': '#808080',
##                           'hue': 'Grey',
##                           'percent': 0.065981308411215,
##                           'spectrum': '#8362aa'},
##                          {'color': '#646464',
##                           'css3': '#696969',
##                           'hue': 'Grey',
##                           'percent': 0.036635514018692,
##                           'spectrum': '#7866ad'},
##                          {'color': '#e1e1e1',
##                           'css3': '#dcdcdc',
##                           'hue': 'Grey',
##                           'percent': 0.0201246105919,
##                           'spectrum': '#955ba5'},
##                          {'color': '#323232',
##                           'css3': '#2f4f4f',
##                           'hue': 'Grey',
##                           'percent': 0.019127725856698,
##                           'spectrum': '#2eb45d'},
##                          {'color': '#967d4b',
##                           'css3': '#a0522d',
##                           'hue': 'Brown',
##                           'percent': 0.014330218068536,
##                           'spectrum': '#7fc241'},
##                          {'color': '#c8c8c8',
##                           'css3': '#c0c0c0',
##                           'hue': 'Grey',
##                           'percent': 0.014205607476636,
##                           'spectrum': '#8c5fa8'}],
##               'commentary': None,
##               'contact': 'am_europeanamerican@harvard.edu',
##               'contextualtextcount': 0,
##               'copyright': None,
##               'creditline': 'Harvard Art Museums/Fogg Museum, Daniel A. '
##                             'Pollack, Class of 1960, American Art Acquisition '
##                             'Fund',
##               'culture': 'American',
##               'datebegin': 1810,
##               'dated': 'c. 1815',
##               'dateend': 1820,
##               'dateoffirstpageview': '2009-09-19',
##               'dateoflastpageview': '2018-12-29',
##               'department': 'Department of American Paintings, Sculpture & '
##                             'Decorative Arts',
##               'description': None,
##               'dimensions': 'framed: 8.9 x 7.6 cm (3 1/2 x 3 in.)',
##               'division': 'European and American Art',
##               'edition': None,
##               'exhibitioncount': 1,
##               'groupcount': 2,
##               'id': 323719,
##               'imagecount': 2,
##               'imagepermissionlevel': 0,
##               'images': [{'baseimageurl': 'https://idscache.harvardartmuseums.org/ids/view/21079824',
##                           'copyright': 'President and Fellows of Harvard '
##                                        'College',
##                           'displayorder': 1,
##                           'format': 'image/jpeg',
##                           'height': 1024,
##                           'idsid': 21079824,
##                           'iiifbaseuri': 'https://ids.lib.harvard.edu/ids/iiif/21079824',
##                           'imageid': 400116,
##                           'publiccaption': None,
##                           'renditionnumber': 'DDC111346',
##                           'width': 735},
##                          {'baseimageurl': 'https://idscache.harvardartmuseums.org/ids/view/20487591',
##                           'copyright': 'President and Fellows of Harvard '
##                                        'College',
##                           'displayorder': 2,
##                           'format': 'image/jpeg',
##                           'height': 1024,
##                           'idsid': 20487591,
##                           'iiifbaseuri': 'https://ids.lib.harvard.edu/ids/iiif/20487591',
##                           'imageid': 323266,
##                           'publiccaption': None,
##                           'renditionnumber': 'INV203168',
##                           'width': 753}],
##               'labeltext': None,
##               'lastupdate': '2019-01-22T03:50:23-0500',
##               'markscount': 0,
##               'mediacount': 0,
##               'medium': 'Watercolor on ivory',
##               'objectid': 323719,
##               'objectnumber': '2008.35',
##               'people': [{'alphasort': 'Dickinson, Anson',
##                           'birthplace': 'Milton, CT',
##                           'culture': 'American',
##                           'deathplace': 'Milton, CT',
##                           'displaydate': '1779 - 1852',
##                           'displayname': 'Anson Dickinson',
##                           'displayorder': 1,
##                           'gender': 'male',
##                           'name': 'Anson Dickinson',
##                           'personid': 53243,
##                           'prefix': None,
##                           'role': 'Artist'}],
##               'peoplecount': 1,
##               'period': None,
##               'periodid': None,
##               'primaryimageurl': 'https://idscache.harvardartmuseums.org/ids/view/21079824',
##               'provenance': "Sold at Christie's March 20, 1990, lot 112; "
##                             'purchased by private collector; Fogg purchase at '
##                             "Bonham's London, November 21, 2007, lot 308.",
##               'publicationcount': 1,
##               'rank': 2,
##               'relatedcount': 0,
##               'seeAlso': [{'format': 'application/json',
##                            'id': 'https://iiif.harvardartmuseums.org/manifests/object/323719',
##                            'profile': 'http://iiif.io/api/presentation/2/context.json',
##                            'type': 'IIIF Manifest'}],
##               'signed': 'engraved verso: Anson Dickinson/pinxit',
##               'standardreferencenumber': None,
##               'state': None,
##               'style': None,
##               'technique': None,
##               'techniqueid': None,
##               'title': 'A Young Lady',
##               'titlescount': 1,
##               'totalpageviews': 462,
##               'totaluniquepageviews': 376,
##               'url': 'https://www.harvardartmuseums.org/collections/object/323719',
##               'verificationlevel': 4,
##               'verificationleveldescription': 'Best. Object is extensively '
##                                               'researched, well described and '
##                                               'information is vetted',
##               'worktypes': [{'worktype': 'painting', 'worktypeid': '242'}]},
##              {'accessionmethod': 'Gift',
##               'accessionyear': 2000,
##               'accesslevel': 1,
##               'century': '19th century',
##               'classification': 'Prints',
##               'classificationid': 23,
##               'colorcount': 8,
##               'colors': [{'color': '#af7d4b',
##                           'css3': '#cd853f',
##                           'hue': 'Yellow',
##                           'percent': 0.45495327102804,
##                           'spectrum': '#e9715f'},
##                          {'color': '#af9664',
##                           'css3': '#bdb76b',
##                           'hue': 'Yellow',
##                           'percent': 0.27314641744548,
##                           'spectrum': '#e9715f'},
##                          {'color': '#96644b',
##                           'css3': '#a0522d',
##                           'hue': 'Brown',
##                           'percent': 0.12629283489097,
##                           'spectrum': '#c25687'},
##                          {'color': '#c8c8c8',
##                           'css3': '#c0c0c0',
##                           'hue': 'Grey',
##                           'percent': 0.06417445482866,
##                           'spectrum': '#8c5fa8'},
##                          {'color': '#7d4b32',
##                           'css3': '#8b4513',
##                           'hue': 'Brown',
##                           'percent': 0.039439252336449,
##                           'spectrum': '#c25687'},
##                          {'color': '#c8af96',
##                           'css3': '#d2b48c',
##                           'hue': 'Brown',
##                           'percent': 0.024797507788162,
##                           'spectrum': '#e66c64'},
##                          {'color': '#afafaf',
##                           'css3': '#a9a9a9',
##                           'hue': 'Grey',
##                           'percent': 0.014766355140187,
##                           'spectrum': '#8c5fa8'},
##                          {'color': '#e1e1e1',
##                           'css3': '#dcdcdc',
##                           'hue': 'Grey',
##                           'percent': 0.0024299065420561,
##                           'spectrum': '#955ba5'}],
##               'commentary': None,
##               'contact': 'am_europeanamerican@harvard.edu',
##               'contextualtextcount': 0,
##               'copyright': None,
##               'creditline': 'Harvard Art Museums/Fogg Museum, Gift of the '
##                             'Woodner Family Collection, Inc.',
##               'culture': 'French',
##               'datebegin': 1894,
##               'dated': '1894',
##               'dateend': 1894,
##               'dateoffirstpageview': '2009-05-14',
##               'dateoflastpageview': '2018-12-29',
##               'department': 'Department of Prints',
##               'description': None,
##               'dimensions': 'sheet: 29.3 x 20.7 cm (11 9/16 x 8 1/8 in.)',
##               'division': 'European and American Art',
##               'edition': None,
##               'exhibitioncount': 2,
##               'groupcount': 2,
##               'id': 186076,
##               'imagecount': 2,
##               'imagepermissionlevel': 0,
##               'images': [{'baseimageurl': 'https://idscache.harvardartmuseums.org/ids/view/17917275',
##                           'copyright': 'President and Fellows of Harvard '
##                                        'College',
##                           'displayorder': 1,
##                           'format': 'image/jpeg',
##                           'height': 2550,
##                           'idsid': 17917275,
##                           'iiifbaseuri': 'https://ids.lib.harvard.edu/ids/iiif/17917275',
##                           'imageid': 391788,
##                           'publiccaption': None,
##                           'renditionnumber': 'DDC111041',
##                           'width': 1833},
##                          {'baseimageurl': 'https://idscache.harvardartmuseums.org/ids/view/17358148',
##                           'copyright': 'President and Fellows of Harvard '
##                                        'College',
##                           'displayorder': 2,
##                           'format': 'image/jpeg',
##                           'height': 2550,
##                           'idsid': 17358148,
##                           'iiifbaseuri': 'https://ids.lib.harvard.edu/ids/iiif/17358148',
##                           'imageid': 42269,
##                           'publiccaption': None,
##                           'renditionnumber': '50749',
##                           'width': 1816}],
##               'labeltext': None,
##               'lastupdate': '2019-01-22T03:48:38-0500',
##               'markscount': 1,
##               'mediacount': 0,
##               'medium': 'Monotype, watercolor and oil, on tan Asian paper\r\n'
##                         '\r\n',
##               'objectid': 186076,
##               'objectnumber': 'M24382',
##               'people': [{'alphasort': 'Gauguin, Paul',
##                           'birthplace': 'Paris',
##                           'culture': 'French',
##                           'deathplace': 'Fatu-Iwa [Marquesas Islands]',
##                           'displaydate': '1848 - 1903',
##                           'displayname': 'Paul Gauguin',
##                           'displayorder': 1,
##                           'gender': 'male',
##                           'name': 'Paul Gauguin',
##                           'personid': 21674,
##                           'prefix': None,
##                           'role': 'Artist'}],
##               'peoplecount': 1,
##               'period': None,
##               'periodid': None,
##               'primaryimageurl': 'https://idscache.harvardartmuseums.org/ids/view/17917275',
##               'provenance': None,
##               'publicationcount': 1,
##               'rank': 3,
##               'relatedcount': 0,
##               'seeAlso': [{'format': 'application/json',
##                            'id': 'https://iiif.harvardartmuseums.org/manifests/object/186076',
##                            'profile': 'http://iiif.io/api/presentation/2/context.json',
##                            'type': 'IIIF Manifest'}],
##               'signed': None,
##               'standardreferencenumber': 'Field 31',
##               'state': None,
##               'style': None,
##               'technique': 'Monotype',
##               'techniqueid': 881,
##               'title': 'Oviri',
##               'titlescount': 1,
##               'totalpageviews': 476,
##               'totaluniquepageviews': 412,
##               'url': 'https://www.harvardartmuseums.org/collections/object/186076',
##               'verificationlevel': 3,
##               'verificationleveldescription': 'Good. Object is well described '
##                                               'and information is vetted',
##               'worktypes': [{'worktype': 'print', 'worktypeid': '278'}]},
##              {'accessionmethod': 'Gift',
##               'accessionyear': 1978,
##               'accesslevel': 1,
##               'century': '18th century',
##               'classification': 'Drawings',
##               'classificationid': 21,
##               'colorcount': 6,
##               'colors': [{'color': '#e1c8af',
##                           'css3': '#f5deb3',
##                           'hue': 'Orange',
##                           'percent': 0.25892655367232,
##                           'spectrum': '#e9715f'},
##                          {'color': '#c8af96',
##                           'css3': '#d2b48c',
##                           'hue': 'Brown',
##                           'percent': 0.23152542372881,
##                           'spectrum': '#e66c64'},
##                          {'color': '#af967d',
##                           'css3': '#bc8f8f',
##                           'hue': 'Brown',
##                           'percent': 0.21333333333333,
##                           'spectrum': '#c25687'},
##                          {'color': '#af7d64',
##                           'css3': '#cd5c5c',
##                           'hue': 'Brown',
##                           'percent': 0.18536723163842,
##                           'spectrum': '#c85783'},
##                          {'color': '#96644b',
##                           'css3': '#a0522d',
##                           'hue': 'Brown',
##                           'percent': 0.084689265536723,
##                           'spectrum': '#c25687'},
##                          {'color': '#e1e1c8',
##                           'css3': '#dcdcdc',
##                           'hue': 'Green',
##                           'percent': 0.026158192090395,
##                           'spectrum': '#e9715f'}],
##               'commentary': None,
##               'contact': 'am_europeanamerican@harvard.edu',
##               'contextualtextcount': 0,
##               'copyright': None,
##               'creditline': 'Harvard Art Museums/Fogg Museum, Gift of Therese '
##                             'Kuhn Straus in memory of her husband, Herbert N. '
##                             'Straus, Harvard Class of 1903',
##               'culture': 'French',
##               'datebegin': 1759,
##               'dated': 'c. 1764',
##               'dateend': 1769,
##               'dateoffirstpageview': '2009-08-01',
##               'dateoflastpageview': '2018-12-29',
##               'department': 'Department of Drawings',
##               'description': None,
##               'dimensions': '29.2 x 37.2 cm (11 1/2 x 14 5/8 in.)',
##               'division': 'European and American Art',
##               'edition': None,
##               'exhibitioncount': 0,
##               'groupcount': 2,
##               'id': 295813,
##               'imagecount': 1,
##               'imagepermissionlevel': 0,
##               'images': [{'baseimageurl': 'https://idscache.harvardartmuseums.org/ids/view/17388899',
##                           'copyright': 'President and Fellows of Harvard '
##                                        'College',
##                           'displayorder': 1,
##                           'format': 'image/jpeg',
##                           'height': 2018,
##                           'idsid': 17388899,
##                           'iiifbaseuri': 'https://ids.lib.harvard.edu/ids/iiif/17388899',
##                           'imageid': 54292,
##                           'publiccaption': None,
##                           'renditionnumber': '61234',
##                           'width': 2550}],
##               'labeltext': None,
##               'lastupdate': '2019-01-22T03:50:03-0500',
##               'markscount': 2,
##               'mediacount': 0,
##               'medium': 'Red chalk on cream antique laid paper, laid down on a '
##                         'decorated mount',
##               'objectid': 295813,
##               'objectnumber': '1978.51',
##               'people': [{'alphasort': 'Robert, Hubert',
##                           'birthplace': 'Paris',
##                           'culture': 'French',
##                           'deathplace': 'Paris',
##                           'displaydate': '1733 - 1808',
##                           'displayname': 'Hubert Robert',
##                           'displayorder': 1,
##                           'gender': 'male',
##                           'name': 'Hubert Robert',
##                           'personid': 28325,
##                           'prefix': None,
##                           'role': 'Artist'}],
##               'peoplecount': 1,
##               'period': None,
##               'periodid': None,
##               'primaryimageurl': 'https://idscache.harvardartmuseums.org/ids/view/17388899',
##               'provenance': 'Therese Kuhn Straus, New York, gift; to Harvard '
##                             'Art Museums/Fogg Museum, Gift of Therese Kuhn '
##                             'Straus in memory of her husband, Herbert N. '
##                             'Straus, Harvard Class of 1903, 1978.51',
##               'publicationcount': 1,
##               'rank': 4,
##               'relatedcount': 0,
##               'seeAlso': [{'format': 'application/json',
##                            'id': 'https://iiif.harvardartmuseums.org/manifests/object/295813',
##                            'profile': 'http://iiif.io/api/presentation/2/context.json',
##                            'type': 'IIIF Manifest'}],
##               'signed': None,
##               'standardreferencenumber': None,
##               'state': None,
##               'style': None,
##               'technique': None,
##               'techniqueid': None,
##               'title': 'Fountain in the Garden of an Italian Villa',
##               'titlescount': 2,
##               'totalpageviews': 488,
##               'totaluniquepageviews': 409,
##               'url': 'https://www.harvardartmuseums.org/collections/object/295813',
##               'verificationlevel': 4,
##               'verificationleveldescription': 'Best. Object is extensively '
##                                               'researched, well described and '
##                                               'information is vetted',
##               'worktypes': [{'worktype': 'drawing', 'worktypeid': '125'}]},
##              {'accessionmethod': 'Gift',
##               'accessionyear': 2002,
##               'accesslevel': 1,
##               'century': '4th century BCE',
##               'classification': 'Coins',
##               'classificationid': 50,
##               'colorcount': 0,
##               'commentary': None,
##               'contact': 'am_asianmediterranean@harvard.edu',
##               'contextualtextcount': 0,
##               'copyright': None,
##               'creditline': 'Harvard Art Museums/Arthur M. Sackler Museum, '
##                             'Gift of Cornelius C. Vermeule, III',
##               'culture': 'Greek',
##               'datebegin': -319,
##               'dated': '319 BCE-315 BCE',
##               'dateend': -315,
##               'dateoffirstpageview': '2009-05-25',
##               'dateoflastpageview': '2018-12-29',
##               'department': 'Department of Ancient and Byzantine Art & '
##                             'Numismatics',
##               'description': 'Obv.: Head of young Heracles r. wearing lion '
##                              'skin.\r\n'
##                              'Rev.: Zeus seated l. on throne holding eagle and '
##                              'scepter; to l., G [Greek gamma]; below throne, '
##                              'A; to r., amphora; in exergue, ivy-leaf.',
##               'details': {'coins': {'dateonobject': None,
##                                     'denomination': 'tetradrachm',
##                                     'dieaxis': '2',
##                                     'metal': 'AR',
##                                     'obverseinscription': None,
##                                     'reverseinscription': 'G A'}},
##               'dimensions': '17.11 g',
##               'division': 'Asian and Mediterranean Art',
##               'edition': None,
##               'exhibitioncount': 1,
##               'groupcount': 1,
##               'id': 146994,
##               'imagecount': 1,
##               'imagepermissionlevel': 0,
##               'images': [{'baseimageurl': 'https://idscache.harvardartmuseums.org/ids/view/18779226',
##                           'copyright': 'President and Fellows of Harvard '
##                                        'College',
##                           'displayorder': 1,
##                           'format': 'image/jpeg',
##                           'height': 1188,
##                           'idsid': 18779226,
##                           'iiifbaseuri': 'https://ids.lib.harvard.edu/ids/iiif/18779226',
##                           'imageid': 38762,
##                           'publiccaption': None,
##                           'renditionnumber': 'COIN01761',
##                           'width': 2550}],
##               'labeltext': None,
##               'lastupdate': '2019-01-22T03:48:09-0500',
##               'markscount': 0,
##               'mediacount': 0,
##               'medium': 'Silver',
##               'objectid': 146994,
##               'objectnumber': '2002.34',
##               'people': [{'alphasort': 'Alexander III, the Great',
##                           'birthplace': None,
##                           'culture': 'Greek',
##                           'deathplace': 'Babylon',
##                           'displaydate': 'r. 336-323 BCE',
##                           'displayname': 'Alexander III, the Great',
##                           'displayorder': 1,
##                           'gender': 'male',
##                           'name': 'Alexander III, the Great',
##                           'personid': 19560,
##                           'prefix': None,
##                           'role': 'Coin Constituent'}],
##               'peoplecount': 1,
##               'period': 'Hellenistic period, Early',
##               'periodid': 352,
##               'primaryimageurl': 'https://idscache.harvardartmuseums.org/ids/view/18779226',
##               'provenance': 'CNG Sale 54, 14 June 2000, no.472.',
##               'publicationcount': 0,
##               'rank': 5,
##               'relatedcount': 0,
##               'seeAlso': [{'format': 'application/json',
##                            'id': 'https://iiif.harvardartmuseums.org/manifests/object/146994',
##                            'profile': 'http://iiif.io/api/presentation/2/context.json',
##                            'type': 'IIIF Manifest'}],
##               'signed': None,
##               'standardreferencenumber': 'Price 2649',
##               'state': None,
##               'style': None,
##               'technique': 'Struck',
##               'techniqueid': 7320,
##               'title': 'Tetradrachm of Alexander the Great, Sardis',
##               'titlescount': 1,
##               'totalpageviews': 560,
##               'totaluniquepageviews': 461,
##               'url': 'https://www.harvardartmuseums.org/collections/object/146994',
##               'verificationlevel': 3,
##               'verificationleveldescription': 'Good. Object is well described '
##                                               'and information is vetted',
##               'worktypes': [{'worktype': 'coin', 'worktypeid': '100'}]},
##              {'accessionmethod': 'Gift',
##               'accessionyear': 1898,
##               'accesslevel': 1,
##               'century': '18th century',
##               'classification': 'Drawings',
##               'classificationid': 21,
##               'colorcount': 0,
##               'commentary': None,
##               'contact': 'am_europeanamerican@harvard.edu',
##               'contextualtextcount': 0,
##               'copyright': None,
##               'creditline': 'Harvard Art Museums/Fogg Museum, Gift of Belinda '
##                             'L. Randall from the collection of John Witt '
##                             'Randall',
##               'culture': 'Swiss',
##               'datebegin': 1783,
##               'dated': 'c. 1788',
##               'dateend': 1793,
##               'dateoffirstpageview': '2009-08-04',
##               'dateoflastpageview': '2018-12-29',
##               'department': 'Department of Drawings',
##               'description': None,
##               'dimensions': '48.9 x 64.5 cm (19 1/4 x 25 3/8 in.)',
##               'division': 'European and American Art',
##               'edition': None,
##               'exhibitioncount': 6,
##               'groupcount': 1,
##               'id': 300012,
##               'imagecount': 3,
##               'imagepermissionlevel': 0,
##               'images': [{'baseimageurl': 'https://idscache.harvardartmuseums.org/ids/view/43164494',
##                           'copyright': 'President and Fellows of Harvard '
##                                        'College',
##                           'displayorder': 1,
##                           'format': 'image/jpeg',
##                           'height': 771,
##                           'idsid': 43164494,
##                           'iiifbaseuri': 'https://ids.lib.harvard.edu/ids/iiif/43164494',
##                           'imageid': 80186,
##                           'publiccaption': None,
##                           'renditionnumber': 'LEG840',
##                           'width': 1024},
##                          {'baseimageurl': 'https://idscache.harvardartmuseums.org/ids/view/20673582',
##                           'copyright': 'President and Fellows of Harvard '
##                                        'College',
##                           'displayorder': 2,
##                           'format': 'image/jpeg',
##                           'height': 1909,
##                           'idsid': 20673582,
##                           'iiifbaseuri': 'https://ids.lib.harvard.edu/ids/iiif/20673582',
##                           'imageid': 25769,
##                           'publiccaption': None,
##                           'renditionnumber': '30001',
##                           'width': 2550},
##                          {'baseimageurl': 'https://idscache.harvardartmuseums.org/ids/view/20674539',
##                           'copyright': 'President and Fellows of Harvard '
##                                        'College',
##                           'displayorder': 3,
##                           'format': 'image/jpeg',
##                           'height': 823,
##                           'idsid': 20674539,
##                           'iiifbaseuri': 'https://ids.lib.harvard.edu/ids/iiif/20674539',
##                           'imageid': 8720,
##                           'publiccaption': None,
##                           'renditionnumber': '33339',
##                           'width': 1024}],
##               'labeltext': None,
##               'lastupdate': '2019-01-22T03:50:06-0500',
##               'markscount': 2,
##               'mediacount': 0,
##               'medium': 'Black ink and gray wash over graphite on off-white '
##                         'antique laid paper',
##               'objectid': 300012,
##               'objectnumber': '1898.115',
##               'people': [{'alphasort': 'Zingg, Adrian',
##                           'birthplace': 'St. Gall, Switzerland',
##                           'culture': 'Swiss',
##                           'deathplace': 'Leipzig, Germany',
##                           'displaydate': '1734 - 1816',
##                           'displayname': 'Adrian Zingg',
##                           'displayorder': 1,
##                           'gender': 'male',
##                           'name': 'Adrian Zingg',
##                           'personid': 29526,
##                           'prefix': None,
##                           'role': 'Artist'}],
##               'peoplecount': 1,
##               'period': None,
##               'periodid': None,
##               'primaryimageurl': 'https://idscache.harvardartmuseums.org/ids/view/43164494',
##               'provenance': 'Carl August Richter, Dresden; John Witt Randall, '
##                             'bequest; to Belinda Lull Randall, his sister, '
##                             '1892, gift; to Fogg Art Museum, 1898.',
##               'publicationcount': 6,
##               'rank': 6,
##               'relatedcount': 0,
##               'seeAlso': [{'format': 'application/json',
##                            'id': 'https://iiif.harvardartmuseums.org/manifests/object/300012',
##                            'profile': 'http://iiif.io/api/presentation/2/context.json',
##                            'type': 'IIIF Manifest'}],
##               'signed': None,
##               'standardreferencenumber': None,
##               'state': None,
##               'style': None,
##               'technique': None,
##               'techniqueid': None,
##               'title': 'View of Dresden',
##               'titlescount': 1,
##               'totalpageviews': 571,
##               'totaluniquepageviews': 438,
##               'url': 'https://www.harvardartmuseums.org/collections/object/300012',
##               'verificationlevel': 3,
##               'verificationleveldescription': 'Good. Object is well described '
##                                               'and information is vetted',
##               'worktypes': [{'worktype': 'drawing', 'worktypeid': '125'}]},
##              {'accessionmethod': 'Gift',
##               'accessionyear': 1941,
##               'accesslevel': 1,
##               'century': '19th century',
##               'classification': 'Drawings',
##               'classificationid': 21,
##               'colorcount': 8,
##               'colors': [{'color': '#967d64',
##                           'css3': '#808080',
##                           'hue': 'Brown',
##                           'percent': 0.43264957264957,
##                           'spectrum': '#b65590'},
##                          {'color': '#7d644b',
##                           'css3': '#696969',
##                           'hue': 'Yellow',
##                           'percent': 0.34626780626781,
##                           'spectrum': '#b25593'},
##                          {'color': '#644b32',
##                           'css3': '#556b2f',
##                           'hue': 'Brown',
##                           'percent': 0.096638176638177,
##                           'spectrum': '#59ba4a'},
##                          {'color': '#c8af96',
##                           'css3': '#d2b48c',
##                           'hue': 'Brown',
##                           'percent': 0.042165242165242,
##                           'spectrum': '#e66c64'},
##                          {'color': '#4b4b4b',
##                           'css3': '#2f4f4f',
##                           'hue': 'Grey',
##                           'percent': 0.034871794871795,
##                           'spectrum': '#3db657'},
##                          {'color': '#c8c8af',
##                           'css3': '#c0c0c0',
##                           'hue': 'Green',
##                           'percent': 0.031566951566952,
##                           'spectrum': '#b55592'},
##                          {'color': '#e1e1c8',
##                           'css3': '#dcdcdc',
##                           'hue': 'Green',
##                           'percent': 0.012877492877493,
##                           'spectrum': '#e9715f'},
##                          {'color': '#323232',
##                           'css3': '#2f4f4f',
##                           'hue': 'Grey',
##                           'percent': 0.002962962962963,
##                           'spectrum': '#2eb45d'}],
##               'commentary': None,
##               'contact': 'am_europeanamerican@harvard.edu',
##               'contextualtextcount': 0,
##               'copyright': None,
##               'creditline': 'Harvard Art Museums/Fogg Museum, Gift of '
##                             'Grenville L. Winthrop, Class of 1886',
##               'culture': 'American',
##               'datebegin': 1883,
##               'dated': 'c. 1888',
##               'dateend': 1893,
##               'dateoffirstpageview': '2009-09-07',
##               'dateoflastpageview': '2018-12-29',
##               'department': 'Department of Drawings',
##               'description': None,
##               'dimensions': '24.2 x 18.7 cm (9 1/2 x 7 3/8 in.)',
##               'division': 'European and American Art',
##               'edition': None,
##               'exhibitioncount': 0,
##               'groupcount': 2,
##               'id': 307703,
##               'imagecount': 1,
##               'imagepermissionlevel': 0,
##               'images': [{'baseimageurl': 'https://idscache.harvardartmuseums.org/ids/view/17804789',
##                           'copyright': 'President and Fellows of Harvard '
##                                        'College',
##                           'displayorder': 1,
##                           'format': 'image/jpeg',
##                           'height': 1024,
##                           'idsid': 17804789,
##                           'iiifbaseuri': 'https://ids.lib.harvard.edu/ids/iiif/17804789',
##                           'imageid': 132173,
##                           'publiccaption': None,
##                           'renditionnumber': '62187',
##                           'width': 800}],
##               'labeltext': None,
##               'lastupdate': '2019-01-22T03:50:13-0500',
##               'markscount': 4,
##               'mediacount': 0,
##               'medium': 'Gouache, gold leaf, and black ink on brown card',
##               'objectid': 307703,
##               'objectnumber': '1941.76',
##               'people': [{'alphasort': 'Vedder, Elihu',
##                           'birthplace': 'New York, NY',
##                           'culture': 'American',
##                           'deathplace': 'Rome, Italy',
##                           'displaydate': '1836 - 1923',
##                           'displayname': 'Elihu Vedder',
##                           'displayorder': 1,
##                           'gender': 'unknown',
##                           'name': 'Elihu Vedder',
##                           'personid': 29183,
##                           'prefix': None,
##                           'role': 'Artist'}],
##               'peoplecount': 1,
##               'period': None,
##               'periodid': None,
##               'primaryimageurl': 'https://idscache.harvardartmuseums.org/ids/view/17804789',
##               'provenance': 'Sold by the artist to or commissioned by Louis '
##                             'Prang, Boston; his sale, December 6, 1899, lot '
##                             '1170; Grenville L. Winthrop, New York, NY; his '
##                             'gift to Fogg Art Museum, 1941.',
##               'publicationcount': 4,
##               'rank': 7,
##               'relatedcount': 0,
##               'seeAlso': [{'format': 'application/json',
##                            'id': 'https://iiif.harvardartmuseums.org/manifests/object/307703',
##                            'profile': 'http://iiif.io/api/presentation/2/context.json',
##                            'type': 'IIIF Manifest'}],
##               'signed': 'red-brown gouache, l.r.: Vedder',
##               'standardreferencenumber': None,
##               'state': None,
##               'style': None,
##               'technique': None,
##               'techniqueid': None,
##               'title': "Aladdin's Lamp",
##               'titlescount': 1,
##               'totalpageviews': 595,
##               'totaluniquepageviews': 516,
##               'url': 'https://www.harvardartmuseums.org/collections/object/307703',
##               'verificationlevel': 4,
##               'verificationleveldescription': 'Best. Object is extensively '
##                                               'researched, well described and '
##                                               'information is vetted',
##               'worktypes': [{'worktype': 'drawing', 'worktypeid': '125'}]},
##              {'accessionmethod': 'Gift',
##               'accessionyear': 1978,
##               'accesslevel': 1,
##               'century': '18th-19th century',
##               'classification': 'Sculpture',
##               'classificationid': 30,
##               'colorcount': 10,
##               'colors': [{'color': '#7d7d7d',
##                           'css3': '#808080',
##                           'hue': 'Grey',
##                           'percent': 0.31258426966292,
##                           'spectrum': '#8362aa'},
##                          {'color': '#646464',
##                           'css3': '#696969',
##                           'hue': 'Grey',
##                           'percent': 0.28786516853933,
##                           'spectrum': '#7866ad'},
##                          {'color': '#4b4b4b',
##                           'css3': '#2f4f4f',
##                           'hue': 'Grey',
##                           'percent': 0.074456928838951,
##                           'spectrum': '#3db657'},
##                          {'color': '#c8af96',
##                           'css3': '#d2b48c',
##                           'hue': 'Brown',
##                           'percent': 0.066666666666667,
##                           'spectrum': '#e66c64'},
##                          {'color': '#e1c8af',
##                           'css3': '#f5deb3',
##                           'hue': 'Orange',
##                           'percent': 0.062172284644195,
##                           'spectrum': '#e9715f'},
##                          {'color': '#969696',
##                           'css3': '#a9a9a9',
##                           'hue': 'Grey',
##                           'percent': 0.050037453183521,
##                           'spectrum': '#8761aa'},
##                          {'color': '#af967d',
##                           'css3': '#bc8f8f',
##                           'hue': 'Brown',
##                           'percent': 0.048089887640449,
##                           'spectrum': '#c25687'},
##                          {'color': '#967d64',
##                           'css3': '#808080',
##                           'hue': 'Brown',
##                           'percent': 0.040524344569288,
##                           'spectrum': '#b65590'},
##                          {'color': '#7d644b',
##                           'css3': '#696969',
##                           'hue': 'Yellow',
##                           'percent': 0.026292134831461,
##                           'spectrum': '#b25593'},
##                          {'color': '#fae1c8',
##                           'css3': '#ffe4c4',
##                           'hue': 'Yellow',
##                           'percent': 0.01310861423221,
##                           'spectrum': '#ed765a'}],
##               'commentary': None,
##               'contact': 'am_europeanamerican@harvard.edu',
##               'contextualtextcount': 0,
##               'copyright': None,
##               'creditline': 'Harvard Art Museums/Fogg Museum, Gift of Therese '
##                             'Kuhn Straus in memory of her husband, Herbert N. '
##                             'Straus, Harvard Class of 1903',
##               'culture': 'French',
##               'datebegin': 1790,
##               'dated': '1790-1800',
##               'dateend': 1800,
##               'dateoffirstpageview': '2010-09-03',
##               'dateoflastpageview': '2018-12-29',
##               'department': 'Department of Paintings, Sculpture & Decorative '
##                             'Arts',
##               'description': None,
##               'dimensions': '41.8 x 16 x 18.2 cm (16 7/16 x 6 5/16 x 7 3/16 '
##                             'in.)',
##               'division': 'European and American Art',
##               'edition': None,
##               'exhibitioncount': 4,
##               'groupcount': 2,
##               'id': 227915,
##               'imagecount': 2,
##               'imagepermissionlevel': 0,
##               'images': [{'baseimageurl': 'https://idscache.harvardartmuseums.org/ids/view/18189190',
##                           'copyright': 'President and Fellows of Harvard '
##                                        'College',
##                           'displayorder': 1,
##                           'format': 'image/jpeg',
##                           'height': 1024,
##                           'idsid': 18189190,
##                           'iiifbaseuri': 'https://ids.lib.harvard.edu/ids/iiif/18189190',
##                           'imageid': 391877,
##                           'publiccaption': None,
##                           'renditionnumber': 'DDC111121',
##                           'width': 470},
##                          {'baseimageurl': 'https://idscache.harvardartmuseums.org/ids/view/20670381',
##                           'copyright': 'President and Fellows of Harvard '
##                                        'College',
##                           'displayorder': 2,
##                           'format': 'image/jpeg',
##                           'height': 1024,
##                           'idsid': 20670381,
##                           'iiifbaseuri': 'https://ids.lib.harvard.edu/ids/iiif/20670381',
##                           'imageid': 3300,
##                           'publiccaption': None,
##                           'renditionnumber': '14557',
##                           'width': 822}],
##               'labeltext': 'Though never elected to full membership in the '
##                            'Academy, Clodion was one of the most esteemed '
##                            'sculptors of the eighteenth century, and during '
##                            'his first residence in Rome he shared a studio '
##                            'with Jean-Antoine Houdon. Clodion worked '
##                            'frequently with terracotta, or baked clay, a '
##                            'material that was traditionally used by sculptors '
##                            'to explore ideas for larger projects in more '
##                            'expensive materials; it was highly prized by '
##                            'collectors around the middle of the eighteenth '
##                            'century for its sketchy, provisional qualities. '
##                            'Clodion responded to the growing market for '
##                            'terracotta sculpture by producing numerous '
##                            'small-scale works such as these. Toward the end of '
##                            'the century, he favored groups with inverted '
##                            'poses, such as these two figures carrying putti on '
##                            'opposite shoulders and turning in opposite '
##                            'directions, and he embraced the neoclassical '
##                            'style, basing the figures’ drapery and hairstyles '
##                            'on ancient models.',
##               'lastupdate': '2019-01-22T03:49:09-0500',
##               'markscount': 2,
##               'mediacount': 0,
##               'medium': 'Terracotta',
##               'objectid': 227915,
##               'objectnumber': '1978.40',
##               'people': [{'alphasort': 'Clodion, Claude Michel, called',
##                           'birthplace': 'Nancy, France',
##                           'culture': 'French',
##                           'deathplace': 'Paris, France',
##                           'displaydate': '1738 - 1814',
##                           'displayname': 'Claude Michel, called Clodion',
##                           'displayorder': 1,
##                           'gender': 'unknown',
##                           'name': 'Claude Michel, called Clodion',
##                           'personid': 32937,
##                           'prefix': None,
##                           'role': 'Artist'}],
##               'peoplecount': 1,
##               'period': None,
##               'periodid': None,
##               'primaryimageurl': 'https://idscache.harvardartmuseums.org/ids/view/18189190',
##               'provenance': 'Comtesse Montesquiou-Fezensac, Paris, sold '
##                             '[through her sale, Hôtel Drouot, Paris. No. 122]. '
##                             'Carlos G. de Candamo, sold [through Galerie '
##                             'Charpentier, no. 71, 1934]. Therese Kuhn (Mrs. '
##                             'Herbert N.) Straus, New York, NY, gift; to the '
##                             'Fogg Art Museum, 1978.',
##               'publicationcount': 3,
##               'rank': 8,
##               'relatedcount': 0,
##               'seeAlso': [{'format': 'application/json',
##                            'id': 'https://iiif.harvardartmuseums.org/manifests/object/227915',
##                            'profile': 'http://iiif.io/api/presentation/2/context.json',
##                            'type': 'IIIF Manifest'}],
##               'signed': 'incised into top of integral base: CLODION',
##               'standardreferencenumber': None,
##               'state': None,
##               'style': None,
##               'technique': None,
##               'techniqueid': None,
##               'title': 'Young Woman Carrying a Child on Her Left Shoulder',
##               'titlescount': 2,
##               'totalpageviews': 599,
##               'totaluniquepageviews': 520,
##               'url': 'https://www.harvardartmuseums.org/collections/object/227915',
##               'verificationlevel': 4,
##               'verificationleveldescription': 'Best. Object is extensively '
##                                               'researched, well described and '
##                                               'information is vetted',
##               'worktypes': [{'worktype': 'sculpture', 'worktypeid': '317'}]},
##              {'accessionmethod': 'Gift',
##               'accessionyear': 1964,
##               'accesslevel': 1,
##               'century': '18th century',
##               'classification': 'Drawings',
##               'classificationid': 21,
##               'colorcount': 6,
##               'colors': [{'color': '#e1c8af',
##                           'css3': '#f5deb3',
##                           'hue': 'Orange',
##                           'percent': 0.46392156862745,
##                           'spectrum': '#e9715f'},
##                          {'color': '#e1e1c8',
##                           'css3': '#dcdcdc',
##                           'hue': 'Green',
##                           'percent': 0.25831932773109,
##                           'spectrum': '#e9715f'},
##                          {'color': '#c8af96',
##                           'css3': '#d2b48c',
##                           'hue': 'Brown',
##                           'percent': 0.17238095238095,
##                           'spectrum': '#e66c64'},
##                          {'color': '#c8967d',
##                           'css3': '#bc8f8f',
##                           'hue': 'Grey',
##                           'percent': 0.075182072829132,
##                           'spectrum': '#e66c64'},
##                          {'color': '#af7d64',
##                           'css3': '#cd5c5c',
##                           'hue': 'Brown',
##                           'percent': 0.027675070028011,
##                           'spectrum': '#c85783'},
##                          {'color': '#96644b',
##                           'css3': '#a0522d',
##                           'hue': 'Brown',
##                           'percent': 0.0025210084033613,
##                           'spectrum': '#c25687'}],
##               'commentary': None,
##               'contact': 'am_europeanamerican@harvard.edu',
##               'contextualtextcount': 0,
##               'copyright': None,
##               'creditline': 'Harvard Art Museums/Fogg Museum, Gift of John S. '
##                             'Newberry',
##               'culture': 'French',
##               'datebegin': 1713,
##               'dated': 'c. 1713',
##               'dateend': 1713,
##               'dateoffirstpageview': '2009-05-16',
##               'dateoflastpageview': '2018-12-29',
##               'department': 'Department of Drawings',
##               'description': None,
##               'dimensions': '15.5 x 19.6 cm (6 1/8 x 7 11/16 in.)',
##               'division': 'European and American Art',
##               'edition': None,
##               'exhibitioncount': 6,
##               'groupcount': 2,
##               'id': 296710,
##               'imagecount': 1,
##               'imagepermissionlevel': 0,
##               'images': [{'baseimageurl': 'https://idscache.harvardartmuseums.org/ids/view/17357965',
##                           'copyright': 'President and Fellows of Harvard '
##                                        'College',
##                           'displayorder': 1,
##                           'format': 'image/jpeg',
##                           'height': 2021,
##                           'idsid': 17357965,
##                           'iiifbaseuri': 'https://ids.lib.harvard.edu/ids/iiif/17357965',
##                           'imageid': 26374,
##                           'publiccaption': None,
##                           'renditionnumber': '30748',
##                           'width': 2550}],
##               'labeltext': None,
##               'lastupdate': '2019-01-22T03:50:04-0500',
##               'markscount': 2,
##               'mediacount': 0,
##               'medium': 'Red chalk on cream antique laid paper, framing lines '
##                         'in black ink, laid down on off-white card',
##               'objectid': 296710,
##               'objectnumber': '1964.14',
##               'people': [{'alphasort': 'Watteau, Jean-Antoine',
##                           'birthplace': 'Valenciennes',
##                           'culture': 'French',
##                           'deathplace': 'Nogent-sur-Marne',
##                           'displaydate': '1684 - 1721',
##                           'displayname': 'Jean-Antoine Watteau',
##                           'displayorder': 1,
##                           'gender': 'male',
##                           'name': 'Jean-Antoine Watteau',
##                           'personid': 29335,
##                           'prefix': None,
##                           'role': 'Artist'}],
##               'peoplecount': 1,
##               'period': None,
##               'periodid': None,
##               'primaryimageurl': 'https://idscache.harvardartmuseums.org/ids/view/17357965',
##               'provenance': 'Jean-Pierre Norblin de la Gourdaine, Paris; to '
##                             'his son, Louis-Pierre-Martin Norblin de la '
##                             'Gourdaine, Paris, by descent; '
##                             'Marie-Élise-Antoinette-Blanche-Francisca, Baronne '
##                             'Bajot de Connantre (or Conantré; née Symonet), '
##                             'Château de Connantre, Connantre; to her daughter, '
##                             'Marie-Blanche-Charlotte, Comtesse des '
##                             'Isnards-Suze (née Bajot de Connantre), Château de '
##                             'Suze-la-Rousse, Suze-la-Rousse, by descent (or to '
##                             'her other daughter, Anne-Blanche-Caroline, '
##                             'Baronne de Rublé, née Bajot de Connantre, Paris '
##                             'and Château de Rublé, Gimat, by descent); to her '
##                             'daughter, Eliane, Baronne de Witte (née des '
##                             'Isnards-Suze), Paris, by descent; to her '
##                             'daughter, Germaine, Marquise de Bryas (née de '
##                             'Witte), Château de Suze-la-Rousse, '
##                             'Suze-la-Rousse, by descent;  Galerie Cailleux, '
##                             'Paris (by 1958), sold; to John S. Newberry, New '
##                             'York, gift; to Harvard Art Museums/Fogg Museum, '
##                             'Gift of John S. Newberry, inv. no. 1964.14',
##               'publicationcount': 24,
##               'rank': 9,
##               'relatedcount': 0,
##               'seeAlso': [{'format': 'application/json',
##                            'id': 'https://iiif.harvardartmuseums.org/manifests/object/296710',
##                            'profile': 'http://iiif.io/api/presentation/2/context.json',
##                            'type': 'IIIF Manifest'}],
##               'signed': None,
##               'standardreferencenumber': None,
##               'state': None,
##               'style': None,
##               'technique': None,
##               'techniqueid': None,
##               'title': 'Three Views of a Military Drummer',
##               'titlescount': 1,
##               'totalpageviews': 647,
##               'totaluniquepageviews': 545,
##               'url': 'https://www.harvardartmuseums.org/collections/object/296710',
##               'verificationlevel': 4,
##               'verificationleveldescription': 'Best. Object is extensively '
##                                               'researched, well described and '
##                                               'information is vetted',
##               'worktypes': [{'worktype': 'drawing', 'worktypeid': '125'}]},
##              {'accessionmethod': 'Gift',
##               'accessionyear': 1999,
##               'accesslevel': 1,
##               'century': '17th century',
##               'classification': 'Drawings',
##               'classificationid': 21,
##               'colorcount': 10,
##               'colors': [{'color': '#fae1e1',
##                           'css3': '#ffe4e1',
##                           'hue': 'Red',
##                           'percent': 0.70465517241379,
##                           'spectrum': '#e76d63'},
##                          {'color': '#e1c8af',
##                           'css3': '#f5deb3',
##                           'hue': 'Orange',
##                           'percent': 0.075574712643678,
##                           'spectrum': '#e9715f'},
##                          {'color': '#7d7d64',
##                           'css3': '#808080',
##                           'hue': 'Yellow',
##                           'percent': 0.043908045977011,
##                           'spectrum': '#6cbd45'},
##                          {'color': '#96967d',
##                           'css3': '#808080',
##                           'hue': 'Green',
##                           'percent': 0.041896551724138,
##                           'spectrum': '#8e5ea7'},
##                          {'color': '#64644b',
##                           'css3': '#696969',
##                           'hue': 'Green',
##                           'percent': 0.033275862068966,
##                           'spectrum': '#59ba4a'},
##                          {'color': '#afaf96',
##                           'css3': '#a9a9a9',
##                           'hue': 'Green',
##                           'percent': 0.026206896551724,
##                           'spectrum': '#8e5ea7'},
##                          {'color': '#4b4b32',
##                           'css3': '#556b2f',
##                           'hue': 'Green',
##                           'percent': 0.024080459770115,
##                           'spectrum': '#4ab851'},
##                          {'color': '#7d324b',
##                           'css3': '#a52a2a',
##                           'hue': 'Red',
##                           'percent': 0.021666666666667,
##                           'spectrum': '#b25593'},
##                          {'color': '#964b64',
##                           'css3': '#696969',
##                           'hue': 'Red',
##                           'percent': 0.0097126436781609,
##                           'spectrum': '#b65590'},
##                          {'color': '#323232',
##                           'css3': '#2f4f4f',
##                           'hue': 'Grey',
##                           'percent': 0.0089080459770115,
##                           'spectrum': '#2eb45d'}],
##               'commentary': None,
##               'contact': 'am_europeanamerican@harvard.edu',
##               'contextualtextcount': 0,
##               'copyright': None,
##               'creditline': 'The Maida and George Abrams Collection, Fogg Art '
##                             'Museum, Harvard University, Cambridge, '
##                             'Massachusetts',
##               'culture': 'Dutch',
##               'datebegin': 1680,
##               'dated': '1680',
##               'dateend': 1680,
##               'dateoffirstpageview': '2010-02-03',
##               'dateoflastpageview': '2018-12-29',
##               'department': 'Department of Drawings',
##               'description': None,
##               'dimensions': '20 x 15.7 cm (7 7/8 x 6 3/16 in.)',
##               'division': 'European and American Art',
##               'edition': None,
##               'exhibitioncount': 5,
##               'groupcount': 3,
##               'id': 199108,
##               'imagecount': 1,
##               'imagepermissionlevel': 0,
##               'images': [{'baseimageurl': 'https://idscache.harvardartmuseums.org/ids/view/18481425',
##                           'copyright': 'President and Fellows of Harvard '
##                                        'College',
##                           'displayorder': 1,
##                           'format': 'image/jpeg',
##                           'height': 2550,
##                           'idsid': 18481425,
##                           'iiifbaseuri': 'https://ids.lib.harvard.edu/ids/iiif/18481425',
##                           'imageid': 21519,
##                           'publiccaption': None,
##                           'renditionnumber': '71920',
##                           'width': 1999}],
##               'labeltext': None,
##               'lastupdate': '2019-01-22T03:48:47-0500',
##               'markscount': 8,
##               'mediacount': 0,
##               'medium': 'Transparent and opaque watercolor and brown ink over '
##                         'graphite, some glazing in green areas, on cream '
##                         'antique laid paper',
##               'objectid': 199108,
##               'objectnumber': '1999.169',
##               'people': [{'alphasort': 'Saftleven, Herman',
##                           'birthplace': 'Rotterdam, Netherlands',
##                           'culture': 'Dutch',
##                           'deathplace': 'Utrecht, Netherlands',
##                           'displaydate': '1609 - 1685',
##                           'displayname': 'Herman Saftleven',
##                           'displayorder': 1,
##                           'gender': 'male',
##                           'name': 'Herman Saftleven',
##                           'personid': 28454,
##                           'prefix': None,
##                           'role': 'Artist'}],
##               'peoplecount': 1,
##               'period': None,
##               'periodid': None,
##               'primaryimageurl': 'https://idscache.harvardartmuseums.org/ids/view/18481425',
##               'provenance': 'Agneta Block, Vijverhof, The Netherlands. Perhaps '
##                             'Samuel van Huls, The Hague, sold; [Swart, The '
##                             'Hague, 14 May 1736, under portfs. YYY and ZZZ, '
##                             'lot 3882.]  Perhaps Jan Bisschop, sold; [Bosch et '
##                             'al., Rotterdam, 24 June 1771, kb K, under lots '
##                             '70-89]; lots 70-72, 75-78, 81-84, and 89 to H. '
##                             'van den Bergh, lots 73-74, 79-80, 85-86, and 88 '
##                             'to D. de Jongh, lot 87 to Fouquet.  Perhaps heirs '
##                             'of Michiel van den Bergh, Rotterdam, sold; '
##                             '[Holsteyn, Rotterdam, 19 June 1786, kb D, lots 4, '
##                             '14-26]; to Philippie. Or instead perhaps the '
##                             'heirs of Daniel de Jongh, sold; [Van Ryp, '
##                             'Rotterdam, 26 March 1810, kb N, lots 55-61.]  '
##                             'Janet and John E. Marqusee, New York, sold; '
##                             '[Sotheby’s, London, 13 December 1973, lot 155, '
##                             'repr. p. 82]; to [Baskett & Day, London], sold; '
##                             'to Maida and George Abrams, Boston, 1974 (L. '
##                             '3306, verso, lower left); The Maida and George '
##                             'Abrams Collection, 1999.169.\r\n',
##               'publicationcount': 12,
##               'rank': 10,
##               'relatedcount': 0,
##               'seeAlso': [{'format': 'application/json',
##                            'id': 'https://iiif.harvardartmuseums.org/manifests/object/199108',
##                            'profile': 'http://iiif.io/api/presentation/2/context.json',
##                            'type': 'IIIF Manifest'}],
##               'signed': 'Lower left, brown ink, HS. [in ligature] f. 1680',
##               'standardreferencenumber': None,
##               'state': None,
##               'style': None,
##               'technique': None,
##               'techniqueid': None,
##               'title': 'Mullein Pink',
##               'titlescount': 1,
##               'totalpageviews': 661,
##               'totaluniquepageviews': 562,
##               'url': 'https://www.harvardartmuseums.org/collections/object/199108',
##               'verificationlevel': 3,
##               'verificationleveldescription': 'Good. Object is well described '
##                                               'and information is vetted',
##               'worktypes': [{'worktype': 'drawing', 'worktypeid': '125'}]}]}

That’s it. Really, we are done here. Everyone go home!

OK not really, there is still more we can lean. But you have to admit that was pretty easy. If you can identify a service that returns the data you want in structured from, web scraping becomes a pretty trivial enterprise. We’ll discuss several other scenarios and topics, but for some web scraping tasks this is really all you need to know.

Organizing and saving the data

The records we retrieved from https://www.harvardartmuseums.org/browse are arranged as a list of dictionaries. With only a little trouble we can select the fields of interest and arrange these data into a pandas DataFrame. First lets see what fields are available.

## dict_keys(['info', 'records'])
## dict_keys(['accessionyear', 'technique', 'mediacount', 'edition', 'totalpageviews', 'groupcount', 'people', 'objectnumber', 'colorcount', 'lastupdate', 'rank', 'imagecount', 'description', 'dateoflastpageview', 'dateoffirstpageview', 'primaryimageurl', 'colors', 'dated', 'contextualtextcount', 'copyright', 'period', 'accessionmethod', 'url', 'provenance', 'images', 'publicationcount', 'objectid', 'culture', 'verificationleveldescription', 'standardreferencenumber', 'worktypes', 'department', 'state', 'markscount', 'contact', 'titlescount', 'id', 'title', 'verificationlevel', 'division', 'style', 'commentary', 'relatedcount', 'datebegin', 'labeltext', 'totaluniquepageviews', 'dimensions', 'exhibitioncount', 'techniqueid', 'dateend', 'creditline', 'imagepermissionlevel', 'signed', 'periodid', 'century', 'classificationid', 'medium', 'peoplecount', 'accesslevel', 'classification', 'seeAlso'])

Next we can specify the fields we are interested in and use a dict comprehension to organize the values;

Finally we can convert the dict to a DataFrame

##    accessionyear technique              ...                classification                        seeAlso
## 0           2000   Etching              ...                        Prints  [{'id': 'https://iiif.harv...
## 1           2008      None              ...                     Paintings  [{'id': 'https://iiif.harv...
## 2           2000  Monotype              ...                        Prints  [{'id': 'https://iiif.harv...
## 3           1978      None              ...                      Drawings  [{'id': 'https://iiif.harv...
## 4           2002    Struck              ...                         Coins  [{'id': 'https://iiif.harv...
## 5           1898      None              ...                      Drawings  [{'id': 'https://iiif.harv...
## 6           1941      None              ...                      Drawings  [{'id': 'https://iiif.harv...
## 7           1978      None              ...                     Sculpture  [{'id': 'https://iiif.harv...
## 8           1964      None              ...                      Drawings  [{'id': 'https://iiif.harv...
## 9           1999      None              ...                      Drawings  [{'id': 'https://iiif.harv...
## 
## [10 rows x 61 columns]

and write the data to a file.

Iterating to retrieve all the data

Of course we don’t want just the first page of collections. How can we retrieve all of them?

Now that we know the web service works, and how to make requests in Python, we can iterate in the usual way.

For convenience we can flatten the records in each list into one long records list

As before, we can write the data to a .csv file without too much difficulty:

##     accessionyear                      technique              ...                classification                        seeAlso
## 0          2000.0                        Etching              ...                        Prints  [{'id': 'https://iiif.harv...
## 1          2008.0                           None              ...                     Paintings  [{'id': 'https://iiif.harv...
## 2          2000.0                       Monotype              ...                        Prints  [{'id': 'https://iiif.harv...
## 3          1978.0                           None              ...                      Drawings  [{'id': 'https://iiif.harv...
## 4          2002.0                         Struck              ...                         Coins  [{'id': 'https://iiif.harv...
## 5          1898.0                           None              ...                      Drawings  [{'id': 'https://iiif.harv...
## 6          1941.0                           None              ...                      Drawings  [{'id': 'https://iiif.harv...
## 7          1978.0                           None              ...                     Sculpture  [{'id': 'https://iiif.harv...
## 8          1964.0                           None              ...                      Drawings  [{'id': 'https://iiif.harv...
## 9          1999.0                           None              ...                      Drawings  [{'id': 'https://iiif.harv...
## 10         2000.0              Chromogenic print              ...                   Photographs  [{'id': 'https://iiif.harv...
## 11         1949.0                           None              ...                      Drawings  [{'id': 'https://iiif.harv...
## 12         2008.0                        Woodcut              ...                        Prints  [{'id': 'https://iiif.harv...
## 13         1943.0                           None              ...                     Paintings  [{'id': 'https://iiif.harv...
## 14         1943.0                           None              ...                      Drawings  [{'id': 'https://iiif.harv...
## 15         2005.0                           None              ...                     Paintings  [{'id': 'https://iiif.harv...
## 16         1984.0           Gelatin silver print              ...                   Photographs  [{'id': 'https://iiif.harv...
## 17            NaN                 Relief etching              ...                        Prints  [{'id': 'https://iiif.harv...
## 18         2008.0              Chromogenic print              ...                   Photographs  [{'id': 'https://iiif.harv...
## 19         1999.0                           None              ...                     Paintings  [{'id': 'https://iiif.harv...
## 20         1929.0                     Lithograph              ...                        Prints  [{'id': 'https://iiif.harv...
## 21         1995.0                         Carved              ...                     Sculpture  [{'id': 'https://iiif.harv...
## 22         1923.0                           None              ...                     Paintings  [{'id': 'https://iiif.harv...
## 23         2006.0                         Relief              ...                     Sculpture  [{'id': 'https://iiif.harv...
## 24         1928.0                         Carved              ...                     Sculpture  [{'id': 'https://iiif.harv...
## 25            NaN                      Engraving              ...                        Prints  [{'id': 'https://iiif.harv...
## 26            NaN           Gelatin silver print              ...                   Photographs  [{'id': 'https://iiif.harv...
## 27         2008.0                           None              ...                      Drawings  [{'id': 'https://iiif.harv...
## 28         1994.0                           None              ...                       Vessels  [{'id': 'https://iiif.harv...
## 29         2004.0                     Lithograph              ...                        Prints  [{'id': 'https://iiif.harv...
## 30         1943.0                           None              ...                     Sculpture  [{'id': 'https://iiif.harv...
## 31         1985.0                           None              ...                     Paintings  [{'id': 'https://iiif.harv...
## 32         2011.0           Gelatin silver print              ...                   Photographs  [{'id': 'https://iiif.harv...
## 33         2007.0           Gelatin silver print              ...                   Photographs  [{'id': 'https://iiif.harv...
## 34         2000.0                      Engraving              ...                        Prints  [{'id': 'https://iiif.harv...
## 35         1940.0            Chiaroscuro woodcut              ...                        Prints  [{'id': 'https://iiif.harv...
## 36         2011.0           Albumen silver print              ...                   Photographs  [{'id': 'https://iiif.harv...
## 37         2009.0                           None              ...                   Photographs  [{'id': 'https://iiif.harv...
## 38         2009.0  Etching, softground etchin...              ...                        Prints  [{'id': 'https://iiif.harv...
## 39         2011.0              Chromogenic print              ...                   Photographs  [{'id': 'https://iiif.harv...
## 40         1951.0                           None              ...                     Sculpture  [{'id': 'https://iiif.harv...
## 41         2007.0                           None              ...                     Paintings  [{'id': 'https://iiif.harv...
## 42         2009.0                           None              ...                     Paintings  [{'id': 'https://iiif.harv...
## 43         2001.0                           None              ...                     Paintings  [{'id': 'https://iiif.harv...
## 44         2011.0           Gelatin silver print              ...                   Photographs  [{'id': 'https://iiif.harv...
## 45         1943.0                           None              ...                     Paintings  [{'id': 'https://iiif.harv...
## 46         1906.0                           None              ...                     Paintings  [{'id': 'https://iiif.harv...
## 47         1975.0         Cast, lost-wax process              ...                     Sculpture  [{'id': 'https://iiif.harv...
## 48         2006.0                     Lithograph              ...                        Prints  [{'id': 'https://iiif.harv...
## 49         2010.0                   Photogravure              ...                   Photographs  [{'id': 'https://iiif.harv...
## 
## [50 rows x 61 columns]

Exercise: Retrieve exhibits data

In this exercise you will retrieve information about the art exhibitions at Harvard Art Museums from https://www.harvardartmuseums.org/visit/exhibitions

  1. Using a web browser (Firefox or Chrome recommended) inspect the page at https://www.harvardartmuseums.org/visit/exhibitions. Examine the network traffic as you interact with the page. Try to find where the data displayed on that page comes from.
  2. Make a get request in Python to retrieve the data from the URL identified in step1.
  3. Write a loop or list comprehension in Python to retrieve data for the first 5 pages of exhibitions data.
  4. Bonus (optional): Arrange the data you retrieved into dict of lists. Convert it to a pandas DataFrame and save it to a .csv file.

Parse html if you have to

As we’ve seen, you can often inspect network traffic or other sources to locate the source of the data you are interested in and the API used to retrieve it. You should always start by looking for these shortcuts and using them where possible. If you are really lucky, you’ll find a shortcut that returns the data as JSON or XML. If you are not quite so lucky, you will have to parse HTML to retrieve the information you need.

For example, when I inspected the network traffic while interacting with https://www.harvardartmuseums.org/visit/calendar I didn’t see any requests that returned JSON data. The best we can do appears to be https://www.harvardartmuseums.org/visit/calendar?date=, which unfortunately returns HTML.

Retrieving HTML

The first step is the same as before: we make at GET request.

## 'https://www.harvardartmuseums.org/visit/calendar'

As before we can check the headers to see what type of content we received in response to our request.

Parsing HTML using the lxml library

Like JSON, HTML is structured; unlike JSON it is designed to be rendered into a human-readable page rather than simply to store and exchange data in a computer-readable format. Consequently, parsing HTML and extracting information from it is somewhat more difficult than parsing JSON.

While JSON parsing is built into the Python requests library, parsing HTML requires a separate library. I recommend using the HTML parser from the lxml library; others prefer an alternative called BeautyfulSoup.

Using xpath to extract content from HTML

XPath is a tool for identifying particular elements withing a HTML document. The developer tools built into modern web browsers make it easy to generate XPaths that can used to identify the elements of a web page that we wish to extract.

We can open the html document we retrieved and inspect it using our web browser.

Once we identify the element containing the information of interest we can use our web browser to copy the XPath that uniquely identifies that element.

Next we can use python to extract the element of interest:

Once again we can use a web browser to inspect the HTML we’re currently working with, and to figure out what we want to extract from it. Let’s look at the first element in our events list.

As before we can use our browser to find the xpath of the elements we want.

(Note that the html.open_in_browser function adds enclosing html and body tags in order to create a complete web page for viewing. This requires that we adjust the xpath accordingly.)

By repeating this process for each element we want, we can build a list of the xpaths to those elements.

Finally, we can iterate over the elements we want and extract them.

## {'date': 'Thursday, November 1, 2018',
##  'figcaption': '1958 D. A. Flentrop organ, Adolphus Busch Hall, Harvard '
##                'University.',
##  'localtion1': '29 Kirkland Street',
##  'location2': 'Cambridge',
##  'time': '12:15pm - 12:45pm',
##  'title': 'Midday Organ Recital: Beth Elswick'}

Iterating to retrieve content from a list of HTML elements

So far we’ve retrieved information only for the first event. To retrieve data for all the events listed on the page we need to iterate over the events. If we are very lucky, each event will have exactly the same information structured in exactly the same way and we can simply extend the code we wrote above to iterate over the events list.

Unfortunately not all these elements are available for every event, so we need to take care to handle the case where one or more of these elements is not available. We can do that by defining a function that tries to retrieve a value and returns an empty string if it fails.

Armed with this function we can iterate over the list of events and extract the available information for each one.

For convenience we can arrange these values in a pandas DataFrame and save them as .csv files, just as we did with our exhibitions data earlier.

##                        figcaption                          date    ...              localtion1  location2
## 0   1958 D. A. Flentrop organ,...    Thursday, November 1, 2018    ...      29 Kirkland Street  Cambridge
## 1   Rhyton forepart in the for...      Friday, November 2, 2018    ...      224 Western Avenue    Allston
## 2                                      Friday, November 2, 2018    ...        32 Quincy Street  Cambridge
## 3   Deer head rhyton depicting...    Saturday, November 3, 2018    ...        32 Quincy Street  Cambridge
## 4   Donkey head kantharos (dri...    Saturday, November 3, 2018    ...        32 Quincy Street  Cambridge
## 5   Octagonal cup with the for...    Saturday, November 3, 2018    ...        32 Quincy Street  Cambridge
## 6                                    Saturday, November 3, 2018    ...        32 Quincy Street  Cambridge
## 7                                    Saturday, November 3, 2018    ...        32 Quincy Street  Cambridge
## 8                                      Sunday, November 4, 2018    ...        32 Quincy Street  Cambridge
## 9                                      Sunday, November 4, 2018    ...        32 Quincy Street  Cambridge
## 10                                    Tuesday, November 6, 2018    ...        32 Quincy Street  Cambridge
## 11                                    Tuesday, November 6, 2018    ...        32 Quincy Street  Cambridge
## 12  Donkey head kantharos (dri...   Wednesday, November 7, 2018    ...        32 Quincy Street  Cambridge
## 13                   © Nic Lehoux   Wednesday, November 7, 2018    ...        32 Quincy Street  Cambridge
## 14       Corinne Wasmuht, German,   Wednesday, November 7, 2018    ...        32 Quincy Street  Cambridge
## 15  1958 D. A. Flentrop organ,...    Thursday, November 8, 2018    ...      29 Kirkland Street  Cambridge
## 16    Théodore Géricault, French,    Thursday, November 8, 2018    ...        32 Quincy Street  Cambridge
## 17                                     Friday, November 9, 2018    ...        32 Quincy Street  Cambridge
## 18                                  Saturday, November 10, 2018    ...        32 Quincy Street  Cambridge
## 19                                  Saturday, November 10, 2018    ...        32 Quincy Street  Cambridge
## 20                                    Sunday, November 11, 2018    ...        32 Quincy Street  Cambridge
## 21                                    Sunday, November 11, 2018    ...        32 Quincy Street  Cambridge
## 22     Gordon W. Gahan, American,     Monday, November 12, 2018    ...        32 Quincy Street  Cambridge
## 23   Charles Bird King, American,    Tuesday, November 13, 2018    ...        32 Quincy Street  Cambridge
## 24                                 Wednesday, November 14, 2018    ...        32 Quincy Street  Cambridge
## 25  Donkey head kantharos (dri...  Wednesday, November 14, 2018    ...        32 Quincy Street  Cambridge
## 26  Ram head mug depicting sym...  Wednesday, November 14, 2018    ...        32 Quincy Street  Cambridge
## 27                                 Wednesday, November 14, 2018    ...        32 Quincy Street  Cambridge
## 28  1958 D. A. Flentrop organ,...   Thursday, November 15, 2018    ...      29 Kirkland Street  Cambridge
## 29                                  Thursday, November 15, 2018    ...        32 Quincy Street  Cambridge
## 30  Timothy H. O’Sullivan, Ame...     Friday, November 16, 2018    ...        32 Quincy Street  Cambridge
## 31                                    Friday, November 16, 2018    ...        32 Quincy Street  Cambridge
## 32  Bell krater depicting a sy...     Friday, November 16, 2018    ...      224 Western Avenue    Allston
## 33    Théodore Géricault, French,   Saturday, November 17, 2018    ...        32 Quincy Street  Cambridge
## 34                                  Saturday, November 17, 2018    ...        32 Quincy Street  Cambridge
## 35                                  Saturday, November 17, 2018    ...        32 Quincy Street  Cambridge
## 36                                    Sunday, November 18, 2018    ...        32 Quincy Street  Cambridge
## 37                                    Sunday, November 18, 2018    ...        32 Quincy Street  Cambridge
## 38  Eagle head mug. Attributed...  Wednesday, November 21, 2018    ...        32 Quincy Street  Cambridge
## 39         Harry Annas, American,   Thursday, November 22, 2018    ...        32 Quincy Street  Cambridge
## 40                                   Tuesday, November 27, 2018    ...        32 Quincy Street  Cambridge
## 41           Photo: Danny Hoshino    Tuesday, November 27, 2018    ...        32 Quincy Street  Cambridge
## 42  Octagonal cup with the for...  Wednesday, November 28, 2018    ...        32 Quincy Street  Cambridge
## 43  Rhyton with the forepart o...  Wednesday, November 28, 2018    ...        32 Quincy Street  Cambridge
## 44  1958 D. A. Flentrop organ,...   Thursday, November 29, 2018    ...      29 Kirkland Street  Cambridge
## 45                                  Thursday, November 29, 2018    ...        32 Quincy Street  Cambridge
## 46                                    Friday, November 30, 2018    ...        32 Quincy Street  Cambridge
## 47                                    Friday, November 30, 2018    ...        32 Quincy Street  Cambridge
## 
## [48 rows x 6 columns]

Exercise: parsing HTML

In this exercise you will retrieve information about the physical layout of the Harvard Art Museums. The web page at https://www.harvardartmuseums.org/visit/floor-plan contains this information in HTML from.

  1. Using a web browser (Firefox or Chrome recommended) inspect the page at https://www.harvardartmuseums.org/visit/floor-plan. Copy the XPath to the element containing the list of level information. (HINT: the element if interest is a ul, i.e., unordered list.)
  2. Make a get request in Python to retrieve the web page at https://www.harvardartmuseums.org/visit/floor-plan. Extract the content from your request object and parse it using html.fromstring from the lxml library.
  3. Use your web browser to find the XPaths to the facilities housed on level one. Use Python to extract the text from those Xpaths.
  4. Bonus (optional): Write a loop or list comprehension in Python to retrieve data for all the levels.

Use Scrapy for large or complicated projects

Scraping websites using the requests library to make GET and POST requests, and the lxml library to process HTML is a good way to learn basic web scraping techniques. It is a good choice for small to medium size projects. For very large or complicated scraping tasks the scrapy library offers a number of conveniences, including asynchronously retrieval, session management, convenient methods for extracting and storing values, and more. More information about scrapy can be found at https://doc.scrapy.org.

Use a browser driver as a last resort

It is sometimes necessary (or sometimes just easier) to use a web browser as an intermediary rather than communicating directly with a web service. This method has the advantage of being about to use the javascript engine and session management features of a web browser; the main disadvantage is that it is slower and tends to be more fragile than using requests or scrapy to make requests directly from python. For small scraping projects involving complicated sites with CAPTHAs or lots of complicated javascript using a browser driver can be a good option. More information is available at https://www.seleniumhq.org/docs/03_webdriver.jsp.